Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: allow inserting subschemas #3041

Merged
merged 19 commits into from
Nov 7, 2024

Conversation

wjones127
Copy link
Contributor

@wjones127 wjones127 commented Oct 24, 2024

Allow inserting subset of columns in the schema, if missing columns are nullable. Missing columns will be filled with null values. This even works with nested fields.

For example:

import lance
import pyarrow as pa

data = [
    {"vec": [1.0, 2.0, 3.0], "metadata": {"x": 1, "y": 2}},
    {"metadata": {"x": 3}},
    {"vec": [2.0, 3.0, 5.0], "metadata": {"y": 4}},
]
table = pa.Table.from_pylist(data)
ds = lance.write_dataset(table, "./demo")
ds.to_table().to_pandas()
               vec               metadata
0  [1.0, 2.0, 3.0]   {'x': 1.0, 'y': 2.0}
1             None  {'x': 3.0, 'y': None}
2  [2.0, 3.0, 5.0]  {'x': None, 'y': 4.0}
new_data = [
    {"metadata": {"y": 6}},
]
new_table = pa.Table.from_pylist(new_data)
ds = lance.write_dataset(new_table, "./demo", mode="append")
ds.to_table().to_pandas()
               vec               metadata
0  [1.0, 2.0, 3.0]   {'x': 1.0, 'y': 2.0}
1             None  {'x': 3.0, 'y': None}
2  [2.0, 3.0, 5.0]  {'x': None, 'y': 4.0}
3             None    {'x': None, 'y': 6}

Closes #3016

@github-actions github-actions bot added the enhancement New feature or request label Oct 24, 2024
@codecov-commenter
Copy link

codecov-commenter commented Oct 24, 2024

Codecov Report

Attention: Patch coverage is 93.02987% with 49 lines in your changes missing coverage. Please review.

Project coverage is 77.06%. Comparing base (2d3dd67) to head (c897c09).
Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
rust/lance-core/src/datatypes/schema.rs 92.77% 16 Missing and 3 partials ⚠️
rust/lance/src/dataset/fragment.rs 79.76% 16 Missing and 1 partial ⚠️
rust/lance/src/dataset.rs 96.02% 3 Missing and 9 partials ⚠️
rust/lance/src/io/commit.rs 66.66% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3041      +/-   ##
==========================================
- Coverage   77.16%   77.06%   -0.10%     
==========================================
  Files         240      240              
  Lines       79764    80417     +653     
  Branches    79764    80417     +653     
==========================================
+ Hits        61548    61975     +427     
- Misses      15071    15265     +194     
- Partials     3145     3177      +32     
Flag Coverage Δ
unittests 77.06% <93.02%> (-0.10%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@wjones127 wjones127 force-pushed the feat/insert-subschema branch from ef8c62d to 24c27e7 Compare October 30, 2024 20:44
@github-actions github-actions bot added the python label Nov 1, 2024
@wjones127 wjones127 force-pushed the feat/insert-subschema branch from 3c1ac51 to e239a69 Compare November 5, 2024 00:11
@wjones127 wjones127 marked this pull request as ready for review November 5, 2024 02:59
Copy link
Contributor

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work. Thanks for cleaning up all that schema / field comparison logic too!

rust/lance-core/src/datatypes/field.rs Outdated Show resolved Hide resolved
@@ -476,13 +432,13 @@ impl Field {
///
/// If the ids are `[2]`, then this will include the parent `0` and the
/// child `3`.
pub(crate) fn project_by_ids(&self, ids: &[i32]) -> Option<Self> {
pub(crate) fn project_by_ids(&self, ids: &[i32], include_all_children: bool) -> Option<Self> {
let children = self
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super minor nit: I'm guessing the optimizer catches this but it might be faster to only calculate children if we need it...

pub(crate) fn project_by_ids(&self, ids: &[i32], include_all_children: bool) -> Option<Self> {
  if !ids.contains(&self.id) {
      return None;
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually don't want to early return, because even if a field isn't selected, we want to check if it has children that are.

rust/lance-core/src/datatypes/field.rs Show resolved Hide resolved
Comment on lines 813 to 825
// Check if there are any fields that are not in any data files
let field_ids_in_files = opened_files
.iter()
.flat_map(|r| r.projection().fields_pre_order().map(|f| f.id))
.filter(|id| *id >= 0)
.collect::<HashSet<_>>();
let mut missing_fields = projection.field_ids();
missing_fields.retain(|f| !field_ids_in_files.contains(f) && *f >= 0);
if !missing_fields.is_empty() {
let missing_projection = projection.project_by_ids(&missing_fields, true);
let null_reader = NullReader::new(Arc::new(missing_projection), opened_files[0].len());
opened_files.push(Box::new(null_reader));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is neat!

rust/lance/src/dataset/write.rs Outdated Show resolved Hide resolved
@wjones127 wjones127 force-pushed the feat/insert-subschema branch from c53fb4a to f916af7 Compare November 6, 2024 23:20
@wjones127 wjones127 requested a review from westonpace November 7, 2024 18:13
@wjones127 wjones127 merged commit 6d24d84 into lancedb:main Nov 7, 2024
26 checks passed
wjones127 added a commit that referenced this pull request Dec 18, 2024
In #2639 we added support for
*updating* subcolumns. In #3041 we
added support for *inserting* subcolumns. This PR adds support for
upserting them (or doing insert-if-not-exists).

Closes #2904

## Example

```python
import pyarrow as pa
import lance

table = pa.table({
    "id": range(3),
    "a": [1.0, 2.0, 3.0],
    "c": ["x", "x", "x"]
})
dataset = lance.write_dataset(table, "example")

# Upsert: when_matched_update_all + when_not_matched_insert_all
new_data = pa.table({
    "id": [2, 3],
    "c": ["y", "y"]
})
(
    dataset
    .merge_insert(on="id")
    .when_matched_update_all()
    .when_not_matched_insert_all()
    .execute(new_data)
)
dataset.to_table().to_pandas()
```
```
   id    a  c
0   0  1.0  x
1   1  2.0  x
2   2  3.0  y
3   3  NaN  y
```

```python
# Insert-if-not-exists: when_not_matched_insert_all
new_data = pa.table({
    "id": [3, 4],
    "c": ["z", "z"]
})
(
    dataset
    .merge_insert(on="id")
    .when_not_matched_insert_all()
    .execute(new_data)
)
dataset.to_table().to_pandas()

   id    a  c
0   0  1.0  x
1   1  2.0  x
2   2  3.0  y
3   3  NaN  y
4   4  NaN  z
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow inserting subschemas
3 participants